04:00
2026-06-26
arxiv.org
artificial-intelligence
Life After Benchmark Saturation: A Case Study of CORE-Bench
Researchers at arXiv propose a multi-dimensional evaluation framework for AI agents beyond accuracy saturation, using CORE-Bench Hard as a case study. They introduce CORE-Bench v1.1 and an out-of-distβ¦